Data Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.m

Business Problem

Business Problem:
Task    : Detect the fraudulent activities.
Metric : Recall
Sampling: No sampling, use all the data.
Tools: Use python module Pycaret for classification.
Question: How many frauds are correctly classified?

Introduction to Boosting

The term Boosting refers to a family of algorithms which converts weak learner to strong learners.

There are many boosting algorithms:

sklearn.ensemble.GradientBoostingRegressor
xgboost.XGBRegressor # fast and best
lightgbm.LGBMRegressor # extreme fast, little acc than xgb
catboost.CatBoostRegressor # good for categorical feats

Colab

In [49]:
%%capture
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    #!pip install hpsklearn
    !pip install shap eli5 lime scikit-plot watermark
    !pip install optuna hyperopt
    !pip install catboost
    !pip install ipywidgets
    !pip install -U scikit-learn
    !jupyter nbextension enable --py widgetsnbextension

    # create project like folders
    !mkdir -p ../outputs ../images ../reports ../html ../models

    print('Environment: Google Colab')

Imports

In [14]:
import time

notebook_start_time = time.time()
In [15]:
import numpy as np
import pandas as pd

SEED = 100

# visualizatioin
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8,8
plt.rcParams.update({'font.size': 16})
plt.style.use('ggplot')
%matplotlib inline
import seaborn as sns
sns.set(color_codes=True)

# six and pickle
import six
import pickle
import joblib

# mixed
import copy
import pprint
pp = pprint.PrettyPrinter(indent=4)

# sklearn
import sklearn

# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# scale and split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

# sklearn scalar metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# roc auc and curves
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

# confusion matrix and classification report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# boosting
import xgboost, lightgbm, catboost
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBClassifier, DMatrix
from lightgbm import LGBMClassifier, Dataset
from catboost import CatBoostClassifier, Pool, CatBoost

# parameters tuning
from hyperopt import hp, tpe, fmin, Trials, STATUS_OK, STATUS_FAIL
from hyperopt.pyll import scope
from hyperopt.pyll.stochastic import sample

# model intepretation modules
import eli5
import shap
import yellowbrick
import lime
import scikitplot

# version
%load_ext watermark
%watermark -a "Bhishan Poudel" -d -v -m
print()
%watermark -iv
Bhishan Poudel 2020-10-05 

CPython 3.6.9
IPython 5.5.0

compiler   : GCC 8.4.0
system     : Linux
release    : 4.19.112+
machine    : x86_64
processor  : x86_64
CPU cores  : 2
interpreter: 64bit

sklearn     0.22.2.post1
numpy       1.18.5
xgboost     0.90
catboost    0.24.1
eli5        0.10.1
six         1.15.0
yellowbrick 0.9.1
scikitplot  0.3.7
seaborn     0.11.0
pandas      1.1.2
lightgbm    2.2.3
joblib      0.16.0
shap        0.36.0

Useful Scripts

In [16]:
def show_method_attributes(obj, ncols=7,start=None, inside=None):
    """ Show all the attributes of a given method.
    Example:
    ========
    show_method_attributes(list)
     """
    lst = [elem for elem in dir(obj) if elem[0]!='_' ]
    lst = [elem for elem in lst 
           if elem not in 'os np pd sys time psycopg2'.split() ]

    if isinstance(start,str):
        lst = [elem for elem in lst if elem.startswith(start)]
        
    if isinstance(start,tuple) or isinstance(start,list):
        lst = [elem for elem in lst for start_elem in start
               if elem.startswith(start_elem)]
        
    if isinstance(inside,str):
        lst = [elem for elem in lst if inside in elem]
        
    if isinstance(inside,tuple) or isinstance(inside,list):
        lst = [elem for elem in lst for inside_elem in inside
               if inside_elem in elem]

    return pd.DataFrame(np.array_split(lst,ncols)).T.fillna('')
In [112]:
def model_evaluation(model_name, desc, ytest, ypreds,df_eval=None,
                     show=True,sort_col='Recall'):
    if df_eval is None:
        df_eval = pd.DataFrame({'Model': [],
                        'Description':[],
                        'Accuracy':[],
                        'Precision':[],
                        'Recall':[],
                        'F1':[],
                        'AUC':[],
                    })

    # model evaluation
    average = 'binary'
    row_eval = [model_name,desc, 
                sklearn.metrics.accuracy_score(ytest, ypreds),
                sklearn.metrics.precision_score(ytest, ypreds, average=average),
                sklearn.metrics.recall_score(ytest, ypreds, average=average),
                sklearn.metrics.f1_score(ytest, ypreds, average=average),
                sklearn.metrics.roc_auc_score(ytest, ypreds),
                ]

    df_eval.loc[len(df_eval)] = row_eval
    df_eval = df_eval.drop_duplicates()
    df_eval = df_eval.sort_values(sort_col,ascending=False)

    if show:
        display(df_eval.style.background_gradient(subset=[sort_col]))

    return df_eval

df_eval = None

Load the data

In [19]:
ifile = 'https://github.com/bhishanpdl/Datasets/blob/master/Projects/Fraud_detection/raw/creditcard.csv.zip?raw=true'
df = pd.read_csv(ifile,compression='zip')
print(df.shape)
df.head()
(284807, 31)
Out[19]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0
In [20]:
target = 'Class'
features = df.columns.drop(target)
df[target].value_counts(normalize=True)*100
Out[20]:
0    99.827251
1     0.172749
Name: Class, dtype: float64
In [82]:
sns.countplot(x=df[target])
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f67d3bc8f28>

Train test split with stratify

In [21]:
from sklearn.model_selection import train_test_split

df_Xtrain_orig, df_Xtest, ser_ytrain_orig, ser_ytest = train_test_split(
    df.drop(target,axis=1), 
    df[target],
    test_size=0.2, 
    random_state=SEED, 
    stratify=df[target])

ytrain_orig = ser_ytrain_orig.to_numpy().ravel()
ytest = ser_ytest.to_numpy().ravel()

print(df_Xtrain_orig.shape)
df_Xtrain_orig.head()
(227845, 30)
Out[21]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
211885 138616.0 -1.137612 2.345154 -1.767247 0.833982 0.973168 -0.073571 0.802433 0.733137 -1.154087 -0.520340 0.494117 0.799935 0.494576 -0.479666 -0.917177 -0.184117 1.189459 0.937244 0.960749 0.062820 0.114953 0.430613 -0.240819 0.124011 0.187187 -0.402251 0.196277 0.190732 39.46
12542 21953.0 -1.028649 1.141569 2.492561 -0.242233 0.452842 -0.384273 1.256026 -0.816401 1.964560 -0.014216 0.432153 -2.140921 2.274477 0.114128 -1.652894 -0.617302 0.243791 -0.426168 -0.493177 0.350032 -0.380356 -0.037432 -0.503934 0.407129 0.604252 0.233015 -0.433132 -0.491892 7.19
270932 164333.0 -1.121864 -0.195099 1.282634 -3.172847 -0.761969 -0.287013 -0.586367 0.496182 -2.352349 0.350551 -1.319688 -0.942001 1.082210 -0.425735 0.036748 0.380392 -0.033353 0.204609 -0.801465 -0.113632 -0.328953 -0.856937 -0.056198 0.401905 0.406813 -0.440140 0.152356 0.030128 40.00
30330 35874.0 1.094238 -0.760568 -0.392822 -0.611720 -0.722850 -0.851978 -0.185505 -0.095131 -1.122304 0.367009 1.378493 -0.724216 -1.105406 -0.480170 0.220826 1.745743 0.740817 -0.728827 1.016740 0.354148 -0.227392 -1.254285 0.022116 -0.141531 0.114515 -0.652427 -0.037897 0.051254 165.85
272477 165107.0 2.278095 -1.298924 -1.884035 -1.530435 -0.649500 -0.996024 -0.466776 -0.438025 -1.612665 1.631133 -1.126000 -0.938760 0.300621 -0.119667 -0.585453 -1.106244 0.690235 -0.124401 -0.075649 -0.341708 0.123892 0.815909 -0.072537 0.784217 0.403428 0.193747 -0.043185 -0.058719 60.00

Train Validation with stratify

In [22]:
df_Xtrain, df_Xvalid, ser_ytrain, ser_yvalid = train_test_split(
    df_Xtrain_orig, 
    ser_ytrain_orig,
    test_size=0.2, 
    random_state=SEED, 
    stratify=ser_ytrain_orig)

ytrain = ser_ytrain.to_numpy().ravel()
yvalid = ser_yvalid.to_numpy().ravel()

print(df_Xtrain.shape)
(182276, 30)

Modelling catboost

https://catboost.ai/docs/concepts/python-reference_catboostregressor.html

class CatBoostRegressor(

iterations=None,                 learning_rate=None,
depth=None,                      l2_leaf_reg=None,
model_size_reg=None,             rsm=None,
loss_function='RMSE',            border_count=None,
feature_border_type=None,        per_float_feature_quantization=None,
input_borders=None,              output_borders=None,
fold_permutation_block=None,     od_pval=None,
od_wait=None,                    od_type=None,
nan_mode=None,                   counter_calc_method=None,
leaf_estimation_iterations=None, leaf_estimation_method=None,
thread_count=None,               random_seed=None,
use_best_model=None,             best_model_min_trees=None,
verbose=None,                    silent=None,
logging_level=None,              metric_period=None,
ctr_leaf_count_limit=None,       store_all_simple_ctr=None,
max_ctr_complexity=None,         has_time=None,
allow_const_label=None,          one_hot_max_size=None,
random_strength=None,name=None,  ignored_features=None,
train_dir=None,                  custom_metric=None,
eval_metric=None,                bagging_temperature=None,
save_snapshot=None,              snapshot_file=None,
snapshot_interval=None,          fold_len_multiplier=None,
used_ram_limit=None,             gpu_ram_part=None,
pinned_memory_size=None,         allow_writing_files=None,
final_ctr_computation_mode=None, approx_on_full_history=None,
boosting_type=None,              simple_ctr=None,
combinations_ctr=None,           per_feature_ctr=None,
ctr_target_border_count=None,    task_type=None,
device_config=None,              devices=None,
bootstrap_type=None,             subsample=None,
sampling_unit=None,              dev_score_calc_obj_block_size=None,
max_depth=None,                  n_estimators=None,
num_boost_round=None,            num_trees=None,
colsample_bylevel=None,          random_state=None,
reg_lambda=None,                 objective=None,
eta=None,                        max_bin=None,
gpu_cat_features_storage=None,   data_partition=None,
metadata=None,                   early_stopping_rounds=None,
cat_features=None,               grow_policy=None,
min_data_in_leaf=None,           min_child_samples=None,
max_leaves=None,                 num_leaves=None,
score_function=None,             leaf_estimation_backtracking=None,
ctr_history_unit=None,           monotone_constraints=None
)
In [24]:
import catboost
show_method_attributes(catboost,2)
Out[24]:
0 1
0 CatBoost Pool
1 CatBoostClassifier core
2 CatBoostError cv
3 CatBoostRegressor sum_models
4 CatboostError to_classifier
5 EFstrType to_regressor
6 FeaturesData train
7 MetricVisualizer version
8 MultiRegressionCustomMetric widget
9 MultiRegressionCustomObjective
In [25]:
from catboost import CatBoostClassifier, Pool

show_method_attributes(CatBoostClassifier,2)
Out[25]:
0 1
0 best_iteration_ get_test_evals
1 best_score_ get_text_feature_indices
2 calc_feature_statistics get_tree_leaf_counts
3 calc_leaf_indexes grid_search
4 classes_ is_fitted
5 compare iterate_leaf_indexes
6 copy learning_rate_
7 create_metric_calcer load_model
8 drop_unused_features n_features_in_
9 eval_metrics plot_partial_dependence
10 evals_result_ plot_predictions
11 feature_importances_ plot_tree
12 feature_names_ predict
13 fit predict_log_proba
14 get_all_params predict_proba
15 get_best_iteration random_seed_
16 get_best_score randomized_search
17 get_borders save_borders
18 get_cat_feature_indices save_model
19 get_embedding_feature_indices score
20 get_evals_result set_feature_names
21 get_feature_importance set_leaf_values
22 get_leaf_values set_params
23 get_leaf_weights set_scale_and_bias
24 get_metadata shrink
25 get_n_features_in staged_predict
26 get_object_importance staged_predict_log_proba
27 get_param staged_predict_proba
28 get_params tree_count_
29 get_scale_and_bias virtual_ensembles_predict
30 get_test_eval
In [87]:
from catboost import CatBoostClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score,  precision_score, recall_score,f1_score
from sklearn.metrics import confusion_matrix

# time
time_start = time.time()

# current parameters
desc = 'default,random_state=100, cross_validation_ypreds'

Xtr = df_Xtrain.to_numpy()
ytr = ser_ytrain.to_numpy().ravel()
Xtx = df_Xtest.to_numpy()
ytx = ser_ytest.to_numpy().ravel()

# fit the model
model = CatBoostClassifier(verbose=100,random_state=SEED)
model.fit(Xtr, ytr)

# save the model
# joblib.dump(model_cat, 'model_cat.pkl')
# model_cat = joblib.load('model_cat.pkl')

# predictions
skf = StratifiedKFold(n_splits=2,shuffle=True,random_state=SEED)
ypreds_cv = cross_val_predict(model, Xtx, ytx, cv=skf)
ypreds = ypreds_cv

# model evaluation
df_eval = model_evaluation('catboost', desc, ytx,ypreds,df_eval=df_eval)

time_taken = time.time() - time_start
print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))
display(df_eval)
Learning rate set to 0.095119
0:	learn: 0.4064239	total: 82.3ms	remaining: 1m 22s
100:	learn: 0.0014365	total: 7.97s	remaining: 1m 10s
200:	learn: 0.0009553	total: 15.8s	remaining: 1m 2s
300:	learn: 0.0006789	total: 23.9s	remaining: 55.4s
400:	learn: 0.0004701	total: 31.7s	remaining: 47.4s
500:	learn: 0.0003296	total: 39.7s	remaining: 39.5s
600:	learn: 0.0002371	total: 47.8s	remaining: 31.7s
700:	learn: 0.0001719	total: 55.6s	remaining: 23.7s
800:	learn: 0.0001354	total: 1m 3s	remaining: 15.7s
900:	learn: 0.0001058	total: 1m 11s	remaining: 7.83s
999:	learn: 0.0000872	total: 1m 19s	remaining: 0us
Learning rate set to 0.043056
0:	learn: 0.5635981	total: 26.5ms	remaining: 26.5s
100:	learn: 0.0017777	total: 2.33s	remaining: 20.7s
200:	learn: 0.0007480	total: 4.63s	remaining: 18.4s
300:	learn: 0.0003691	total: 6.92s	remaining: 16.1s
400:	learn: 0.0002480	total: 9.17s	remaining: 13.7s
500:	learn: 0.0001734	total: 11.5s	remaining: 11.4s
600:	learn: 0.0001340	total: 13.7s	remaining: 9.09s
700:	learn: 0.0001134	total: 15.9s	remaining: 6.79s
800:	learn: 0.0000977	total: 18.2s	remaining: 4.51s
900:	learn: 0.0000858	total: 20.4s	remaining: 2.24s
999:	learn: 0.0000764	total: 22.7s	remaining: 0us
Learning rate set to 0.043056
0:	learn: 0.5628806	total: 25.7ms	remaining: 25.6s
100:	learn: 0.0019698	total: 2.34s	remaining: 20.8s
200:	learn: 0.0009914	total: 4.66s	remaining: 18.5s
300:	learn: 0.0005191	total: 6.99s	remaining: 16.2s
400:	learn: 0.0003353	total: 9.31s	remaining: 13.9s
500:	learn: 0.0002580	total: 11.6s	remaining: 11.6s
600:	learn: 0.0002056	total: 13.9s	remaining: 9.22s
700:	learn: 0.0001702	total: 16.2s	remaining: 6.91s
800:	learn: 0.0001445	total: 18.5s	remaining: 4.59s
900:	learn: 0.0001232	total: 20.8s	remaining: 2.29s
999:	learn: 0.0001062	total: 23.1s	remaining: 0us
Model Description Accuracy Precision Recall F1 AUC
0 catboost default,random_state=100, numpy 0.999456 0.913580 0.755102 0.826816 0.877489
Time taken: 2 min 7 secs
Model Description Accuracy Precision Recall F1 AUC
0 catboost default,random_state=100, numpy 0.999456 0.91358 0.755102 0.826816 0.877489
In [109]:
%%time
model = CatBoostClassifier(verbose=100,random_state=SEED)
model.fit(Xtr, ytr)
joblib.dump(model, '../models/model_cat_default_seed100.joblib')

ypreds = model.predict(Xtx)
cm = sklearn.metrics.confusion_matrix(ytx,ypres)
print('confusion matrix\n',cm)

desc = 'default, seed=100'
df_eval = model_evaluation('catboost', desc, ytx,ypreds,df_eval=df_eval)
[0 0 0 0 0]
[56885    77]
Model Description Accuracy Precision Recall F1 AUC
0 catboost default,random_state=100, numpy 0.999456 0.913580 0.755102 0.826816 0.877489
1 catboost early stopping, iterations=885 0.999579 0.986842 0.765306 0.862069 0.882644
2 catboost grid search optuna 0.999579 0.974359 0.775510 0.863636 0.887738
3 catboost default, seed=100 0.999631 1.000000 0.785714 0.880000 0.892857
CPU times: user 227 ms, sys: 8.09 ms, total: 235 ms
Wall time: 179 ms
In [89]:
yprobs = model.predict_proba(Xtx)
print(yprobs[:5])
[[9.99972750e-01 2.72503683e-05]
 [9.99996518e-01 3.48188471e-06]
 [9.99998383e-01 1.61719094e-06]
 [9.99995585e-01 4.41504877e-06]
 [9.99989143e-01 1.08570064e-05]]
In [93]:
from scikitplot import metrics as skpmetrics

skpmetrics.plot_confusion_matrix(ytx, ypreds)
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f67d0169ba8>
In [95]:
fig, ax = plt.subplots(figsize=(12,8))
skpmetrics.plot_roc(ytx,yprobs,ax=ax)
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f67d003ff28>
In [28]:
import eli5

# eli5.explain_weights_catboost(model) # same thing
eli5.show_weights(model)
Out[28]:
Weight Feature
0.0776 4
0.0752 1
0.0671 14
0.0566 0
0.0495 8
0.0462 9
0.0451 26
0.0430 12
0.0385 2
0.0378 29
0.0346 10
0.0323 19
0.0318 24
0.0297 6
0.0281 11
0.0280 28
0.0278 13
0.0243 25
0.0237 15
0.0236 18
… 10 more …

Catboost with validation set

In [29]:
df_Xtrain.head(2)
Out[29]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
35574 38177.0 1.430419 -0.718078 0.364706 -0.744257 -0.556090 0.698948 -0.949852 0.131008 -0.314353 0.512322 -0.202255 0.766648 1.495082 -1.037475 -1.935434 0.897715 0.069580 -0.902556 1.843867 0.158424 0.042013 0.429576 -0.301931 -0.933773 0.840490 -0.027776 0.044688 -0.007522 0.2
46862 42959.0 -2.425523 -1.790293 2.522139 0.581141 0.918453 0.594426 0.224541 0.373885 -0.168411 -0.720421 1.394710 1.136436 0.508455 -0.389067 -0.165166 -0.040520 -0.464966 -0.057803 -1.493635 0.984535 0.538438 0.877560 0.590595 -0.293545 0.524022 -0.328189 -0.205285 -0.109163 300.0
In [107]:
# # time
# time_start = time.time()

# # current parameters
# Xtr = df_Xtrain
# ytr = ser_ytrain.to_numpy().ravel()
# Xtx = df_Xtest
# ytx = ser_ytest.to_numpy().ravel()
# Xvd = df_Xvalid
# yvd = ser_yvalid.to_numpy().ravel()

# # fit the model
# model = CatBoostClassifier(random_state=0,verbose=100)
# model.fit(Xtr, ytr,
#           eval_set=(Xvd, yvd))

# # ypreds
# ypreds = model.predict(Xtx)

# # r-squared values
# auc = roc_auc_score(ytx, ypreds)

# # time
# time_taken = time.time() - time_start
# print('Time taken: {:.0f} min {:.0f} secs'.format(*divmod(time_taken,60)))
# print('ROC AUC Score ', auc)
In [31]:
# float feature
feature_name = 'Amount'
dict_stats = model.calc_feature_statistics(df_Xtrain, ser_ytrain, feature_name, plot=True)

Feature Importance

In [32]:
# feature importance
df_imp = pd.DataFrame({'Feature': features,
                       'Importance': model.feature_importances_
                       }) 

df_imp.sort_values('Importance',ascending=False).style.background_gradient()
Out[32]:
Feature Importance
4 V4 9.231178
1 V1 8.884553
12 V12 7.858850
14 V14 6.993834
8 V8 5.448198
0 Time 5.067847
26 V26 4.613358
11 V11 3.664417
16 V16 3.448854
6 V6 3.389446
7 V7 3.175937
29 Amount 3.115607
18 V18 3.041911
10 V10 2.904488
17 V17 2.794143
25 V25 2.587967
27 V27 2.485769
19 V19 2.457251
15 V15 2.322782
2 V2 2.310883
20 V20 2.079404
13 V13 1.815875
28 V28 1.786981
3 V3 1.770443
24 V24 1.469989
22 V22 1.362665
9 V9 1.172512
5 V5 1.104829
23 V23 0.909606
21 V21 0.730423
In [33]:
def plot_feature_imp_catboost(model_catboost,features):
    """Plot the feature importance horizontal bar plot.
    
    """

    df_imp = pd.DataFrame({'Feature': model.feature_names_,
                        'Importance': model.feature_importances_
                        }) 

    df_imp = df_imp.sort_values('Importance').set_index('Feature')
    ax = df_imp.plot.barh(figsize=(12,8))

    plt.grid(True)
    plt.title('Feature Importance',fontsize=14)
    ax.get_legend().remove()

    for p in ax.patches:
        x = p.get_width()
        y = p.get_y()
        text = '{:.2f}'.format(p.get_width())
        ax.text(x, y,text,fontsize=15,color='indigo')

    plt.show()

plot_feature_imp_catboost(model, features)
In [34]:
df_fimp = model.get_feature_importance(prettified=True)
df_fimp.head()
Out[34]:
Feature Id Importances
0 V4 9.231178
1 V1 8.884553
2 V12 7.858850
3 V14 6.993834
4 V8 5.448198
In [35]:
plt.figure(figsize=(12,8))
ax = sns.barplot(x=df_fimp.columns[1], y=df_fimp.columns[0], data=df_fimp);

for p in ax.patches:
    x = p.get_width()
    y = p.get_y()
    text = '{:.2f}'.format(p.get_width())
    ax.text(x, y,text,fontsize=15,color='indigo',va='top',ha='left')

catboost using Pool

In [36]:
from catboost import CatBoost, Pool
In [37]:
# help(CatBoost)
In [38]:
cat_features = [] # take it empty for the moment
dtrain = Pool(df_Xtrain, ser_ytrain, cat_features=cat_features)
dvalid = Pool(df_Xvalid, ser_yvalid, cat_features=cat_features)
dtest = Pool(df_Xtest, ser_ytest, cat_features=cat_features)
In [39]:
params_cat = {'iterations': 100, 'verbose': False, 
          'random_seed': 0,
          'eval_metric':'AUC',
          'loss_function':'Logloss',
          'cat_features': [],
          'ignored_features': [],
          'early_stopping_rounds': 200,
          'verbose': 200,
          }

bst_cat = CatBoost(params=params_cat)

bst_cat.fit(dtrain,           
            eval_set=(df_Xvalid, ser_yvalid), 
          use_best_model=True,
          plot=True);

print(bst_cat.eval_metrics(dtest, ['AUC'])['AUC'][-1])
Learning rate set to 0.312111
0:	test: 0.9426860	best: 0.9426860 (0)	total: 92.2ms	remaining: 9.13s
99:	test: 0.9732950	best: 0.9804994 (14)	total: 8.56s	remaining: 0us

bestTest = 0.9804994003
bestIteration = 14

Shrink model to first 15 iterations.
0.9632516501958127

Cross Validation

cv(pool=None, params=None, dtrain=None, iterations=None, 
num_boost_round=None, fold_count=None, nfold=None, inverted=False,
partition_random_seed=0, seed=None, shuffle=True, logging_level=None,
stratified=None, as_pandas=True, metric_period=None, verbose=None,
verbose_eval=None, plot=False, early_stopping_rounds=None,
save_snapshot=None, snapshot_file=None,
snapshot_interval=None, folds=None, type='Classical')
In [40]:
params = {'iterations': 100, 'verbose': False,
          'random_seed': 0,
          'loss_function':'Logloss',
          'eval_metric':'AUC',
          }

df_scores = catboost.cv(dtrain,
            params,
            fold_count=2,
            verbose=100,
            shuffle=True,
            stratified=True,
            plot="True") # plot does not work in google colab
0:	test: 0.9182109	best: 0.9182109 (0)	total: 227ms	remaining: 22.5s
99:	test: 0.9769374	best: 0.9792743 (56)	total: 17.9s	remaining: 0us
In [41]:
print(df_scores.columns)
df_scores.head()
Index(['iterations', 'test-AUC-mean', 'test-AUC-std', 'test-Logloss-mean',
       'test-Logloss-std', 'train-Logloss-mean', 'train-Logloss-std'],
      dtype='object')
Out[41]:
iterations test-AUC-mean test-AUC-std test-Logloss-mean test-Logloss-std train-Logloss-mean train-Logloss-std
0 0 0.918211 0.015632 0.585840 0.001246 0.585823 0.001171
1 1 0.922383 0.027860 0.500689 0.002353 0.500659 0.002239
2 2 0.933871 0.022411 0.425035 0.003157 0.425024 0.003205
3 3 0.928061 0.020897 0.365778 0.003360 0.365737 0.003457
4 4 0.939572 0.017085 0.310018 0.004005 0.309959 0.003970
In [42]:
sns.lineplot(x='iterations',y='train-Logloss-mean',data=df_scores,ax=ax,color='r')
sns.lineplot(x='iterations',y='test-Logloss-mean',data=df_scores,ax=ax,
             color='b',alpha=0.2,linewidth=5,linestyle='--')

plt.show()

HPO (Hyper Parameter Optimization)

We generally should optimize model complexity and then tune the convergence.

model complexity: max_depth etc convergence: learning rate

Parameters:

  • learning_rate: step size shrinkage used to prevent overfitting. Range is [0,1]
  • depth: determines how deeply each tree is allowed to grow during any boosting round.
  • subsample: percentage of samples used per tree. Low value can lead to underfitting.
  • colsample_bytree: percentage of features used per tree. High value can lead to overfitting.

Baseline model

In [110]:
model = joblib.load('../models/model_cat_default_seed100.joblib')
ypreds = model.predict(df_Xtest)
cm = confusion_matrix(ytest, ypreds)
print(cm)
[[56864     0]
 [   21    77]]

Using Early Stopping from Validation Set

In [103]:
%%time
params = dict(verbose=500,
              random_state=0,
              iterations=3_000,
              eval_metric='AUC',
              cat_features = [],
              early_stopping_rounds=200,
            )

model = catboost.CatBoostClassifier(**params)

model.fit(df_Xtrain, ytrain, 
          eval_set=(df_Xvalid, yvalid), 
          use_best_model=True, 
          plot=False
         );

# now use the best iteration
best_iter = model.get_best_iteration()

model = CatBoostClassifier(verbose=False,random_state=0,iterations=best_iter)
model.fit(df_Xtrain, ser_ytrain)
joblib.dump(model, '../models/model_cat_earlystopping.joblib')


ypreds = model.predict(df_Xtest)

cm = confusion_matrix(ytest, ypreds)
print(cm)

desc = f'early stopping, iterations={best_iter}'
df_eval = model_evaluation('catboost', desc, ytx,ypreds,df_eval=df_eval)

# using best iterations is worse, use previous 1000.
[[56863     1]
 [   23    75]]
Model Description Accuracy Precision Recall F1 AUC
0 catboost default,random_state=100, numpy 0.999456 0.913580 0.755102 0.826816 0.877489
1 catboost early stopping, iterations=885 0.999579 0.986842 0.765306 0.862069 0.882644
CPU times: user 2min 11s, sys: 6.69 s, total: 2min 18s
Wall time: 1min 11s
In [104]:
# for n in [6]: # default detpth = 6

#     model = CatBoostClassifier(verbose=False,random_state=0,
#                               iterations=1_000,
#                               depth=n,
#                               )
#     model.fit(Xtr, ytr)
#     ypreds = model.predict(Xtx)
#     cm = confusion_matrix(ytest, ypreds)
#     error = cm[0,1] + cm[1,0]
#     print(f'Confusion matrix error count = {error} for n = {n}')

Try Your luck with different random states

In [105]:
# for n in [0]: 

#     model = CatBoostClassifier(verbose=False,random_state=n,
#                                depth=6,
#                               iterations=1_000,
#                               )
#     model.fit(Xtr, ytr)
#     ypreds = model.predict(Xtx)
#     cm = confusion_matrix(ytest, ypreds)
#     error = cm[0,1] + cm[1,0]
#     print(f'Confusion matrix error count = {error} for n = {n}')

HPO Hyper Parameter Optimization with Optuna

In [50]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING) # use INFO to see progress
In [51]:
def objective(trial):

    params_cat_optuna = {
        'objective': trial.suggest_categorical('objective', ['Logloss', 'CrossEntropy']),
        'colsample_bylevel': trial.suggest_uniform('colsample_bylevel', 0.01, 0.1),
        'depth': trial.suggest_int('depth', 1, 12),
        'boosting_type': trial.suggest_categorical('boosting_type', ['Ordered', 'Plain']),
        'bootstrap_type': trial.suggest_categorical('bootstrap_type',
                                                    ['Bayesian', 'Bernoulli', 'MVS']),
        'used_ram_limit': '3gb'
    }

    # update parameters
    if params_cat_optuna['bootstrap_type'] == 'Bayesian':
        params_cat_optuna['bagging_temperature'] = trial.suggest_uniform('bagging_temperature', 0, 10)
    elif params_cat_optuna['bootstrap_type'] == 'Bernoulli':
        params_cat_optuna['subsample'] = trial.suggest_uniform('subsample', 0.1, 1)
        
    # fit the model
    model = CatBoostClassifier(random_state=SEED,**params_cat_optuna)
    model.fit(df_Xtrain, ser_ytrain,
            eval_set=[(df_Xvalid, ser_yvalid)],
            verbose=0,
            early_stopping_rounds=100)
    
    ypreds = model.predict(df_Xvalid)
    ypreds = np.rint(ypreds)
    score = roc_auc_score(ser_yvalid.to_numpy().ravel(),
                              ypreds)
    return score
In [52]:
# NOTE: there is inherent non-determinism in optuna hyperparameter selection
#       we may not get the same hyperparameters when run twice.


sampler = optuna.samplers.TPESampler(seed=SEED)
N_TRIALS = 1 # make it large

study = optuna.create_study(direction='maximize',
                            sampler=sampler,
                            study_name='cat_optuna',
                            storage='sqlite:///cat_optuna_fraud_detection.db',
                            load_if_exists=True)

study.optimize(objective, n_trials=N_TRIALS,timeout=600)
In [53]:
# Resume from last time
sampler = optuna.samplers.TPESampler(seed=SEED)
N_TRIALS = 1 # make it large

study = optuna.create_study(direction='maximize',
                            sampler=sampler,
                            study_name='cat_optuna',
                            storage='sqlite:///cat_optuna_fraud_detection.db',
                            load_if_exists=True)

# study.optimize(objective, n_trials=N_TRIALS)
In [54]:
print(f'Number of finished trials: {len(study.trials)}')

# best trail
best_trial = study.best_trial

# best params
params_best = study.best_trial.params
params_best
Number of finished trials: 2
Out[54]:
{'bagging_temperature': 1.4860484007536512,
 'boosting_type': 'Plain',
 'bootstrap_type': 'Bayesian',
 'colsample_bylevel': 0.07040400702545975,
 'depth': 8,
 'objective': 'Logloss'}
In [106]:
%%time

model_name = 'catboost'
desc = 'grid search optuna'
Xtr = df_Xtrain_orig
ytr = ser_ytrain_orig.to_numpy().ravel()
Xtx = df_Xtest
ytx = ser_ytest.to_numpy().ravel()
Xvd = df_Xvalid
yvd = ser_yvalid.to_numpy().ravel()

# use best model
params_best =  study.best_trial.params

clf = CatBoostClassifier(random_state=SEED,verbose=False)
clf.set_params(**params_best)

# fit and save the model
clf.fit(Xtr, ytr)
joblib.dump(clf,'../models/clf_cat_grid_search_optuna.pkl')

# load the saved model
clf = joblib.load('../models/clf_cat_grid_search_optuna.pkl')

# predictions
ypreds = clf.predict(Xtx)

# model evaluation
cm = confusion_matrix(ytx, ypreds)
print(cm)

desc = f'grid search optuna'
df_eval = model_evaluation('catboost', desc, ytx,ypreds,df_eval=df_eval)
[[56862     2]
 [   22    76]]
Model Description Accuracy Precision Recall F1 AUC
0 catboost default,random_state=100, numpy 0.999456 0.913580 0.755102 0.826816 0.877489
1 catboost early stopping, iterations=885 0.999579 0.986842 0.765306 0.862069 0.882644
2 catboost grid search optuna 0.999579 0.974359 0.775510 0.863636 0.887738
CPU times: user 2min 19s, sys: 7.3 s, total: 2min 26s
Wall time: 1min 16s

Best Model

In [111]:
%%time
model = CatBoostClassifier(verbose=False,random_state=100,
                            depth=6,
                            iterations=1_000,
                            )
model.fit(Xtr, ytr)
joblib.dump(model, '../models/model_cat_best.joblib')

ypreds = model.predict(Xtx)
cm = confusion_matrix(ytest, ypreds)

print(cm)
df_eval = model_evaluation('catboost', 'seed=100,depth=6,iter=1k', ytest, ypreds,df_eval=df_eval)
[[56864     0]
 [   21    77]]
Model Description Accuracy Precision Recall F1 AUC
0 catboost default,random_state=100, numpy 0.999456 0.913580 0.755102 0.826816 0.877489
1 catboost early stopping, iterations=885 0.999579 0.986842 0.765306 0.862069 0.882644
2 catboost grid search optuna 0.999579 0.974359 0.775510 0.863636 0.887738
3 catboost default, seed=100 0.999631 1.000000 0.785714 0.880000 0.892857
4 catboost seed=100,depth=6,iter=1k 0.999631 1.000000 0.785714 0.880000 0.892857
CPU times: user 3min, sys: 7.99 s, total: 3min 8s
Wall time: 1min 36s

Model Interpretation

In [62]:
df_Xtrain.head(2).append(df_Xtest.head(2))
Out[62]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
35574 38177.0 1.430419 -0.718078 0.364706 -0.744257 -0.556090 0.698948 -0.949852 0.131008 -0.314353 0.512322 -0.202255 0.766648 1.495082 -1.037475 -1.935434 0.897715 0.069580 -0.902556 1.843867 0.158424 0.042013 0.429576 -0.301931 -0.933773 0.840490 -0.027776 0.044688 -0.007522 0.20
46862 42959.0 -2.425523 -1.790293 2.522139 0.581141 0.918453 0.594426 0.224541 0.373885 -0.168411 -0.720421 1.394710 1.136436 0.508455 -0.389067 -0.165166 -0.040520 -0.464966 -0.057803 -1.493635 0.984535 0.538438 0.877560 0.590595 -0.293545 0.524022 -0.328189 -0.205285 -0.109163 300.00
248750 154078.0 0.046622 1.529678 -0.453615 1.282569 1.110333 -0.882716 1.046420 -0.117121 -0.679897 -0.923709 0.371519 -0.000047 0.512255 -2.091762 0.786796 0.159652 1.706939 0.458922 0.037665 0.240559 -0.338472 -0.839547 0.066527 0.836447 0.076790 -0.775158 0.261012 0.058359 18.70
161573 114332.0 0.145870 0.107484 0.755127 -0.995936 1.159107 2.113961 0.036200 0.471777 0.627622 -0.598398 0.713816 1.091294 0.663878 -0.448057 0.146422 -0.445603 -0.462439 -0.373996 -0.966334 -0.107332 0.297644 1.285809 -0.140560 -0.910706 -0.449729 -0.235203 -0.036910 -0.227111 9.99

Model interpretation using eli5

In [63]:
import eli5

eli5.show_weights(model)
Out[63]:
Weight Feature
0.1009 V1
0.0653 V4
0.0641 V14
0.0604 V26
0.0542 Amount
0.0389 V12
0.0371 V15
0.0369 V10
0.0354 V11
0.0333 Time
0.0298 V8
0.0297 V19
0.0281 V13
0.0274 V7
0.0273 V20
0.0267 V2
0.0255 V3
0.0254 V22
0.0253 V16
0.0247 V18
… 10 more …
In [64]:
from eli5.sklearn import PermutationImportance

feature_names = df_Xtrain.columns.tolist()

perm = PermutationImportance(model).fit(df_Xtest, ytx)
eli5.show_weights(perm, feature_names=feature_names)
Out[64]:
Weight Feature
0.0008 ± 0.0000 V14
0.0003 ± 0.0000 V4
0.0002 ± 0.0000 V10
0.0001 ± 0.0001 V26
0.0001 ± 0.0001 Amount
0.0001 ± 0.0000 V28
0.0001 ± 0.0000 V12
0.0001 ± 0.0000 V17
0.0001 ± 0.0000 V16
0.0000 ± 0.0001 V1
0.0000 ± 0.0000 V22
0.0000 ± 0.0000 V19
0.0000 ± 0.0000 V27
0.0000 ± 0.0000 V20
0.0000 ± 0.0000 V8
0.0000 ± 0.0000 V6
0.0000 ± 0.0000 V3
0.0000 ± 0.0000 V5
0.0000 ± 0.0000 V7
0.0000 ± 0.0000 V25
… 10 more …

Model interpretation using lime

In [65]:
import lime
import lime.lime_tabular
In [66]:
idx = 0
example = df_Xtest.iloc[idx]
answer = ser_ytest.iloc[idx]
feature_names = df_Xtest.columns.tolist()

prediction = model.predict(example.to_numpy().reshape(-1,1).T)


print(f'answer     = {answer}')
print('prediction = ', prediction[0])
print()
print(example)
print(feature_names)
answer     = 0
prediction =  0

Time      154078.000000
V1             0.046622
V2             1.529678
V3            -0.453615
V4             1.282569
V5             1.110333
V6            -0.882716
V7             1.046420
V8            -0.117121
V9            -0.679897
V10           -0.923709
V11            0.371519
V12           -0.000047
V13            0.512255
V14           -2.091762
V15            0.786796
V16            0.159652
V17            1.706939
V18            0.458922
V19            0.037665
V20            0.240559
V21           -0.338472
V22           -0.839547
V23            0.066527
V24            0.836447
V25            0.076790
V26           -0.775158
V27            0.261012
V28            0.058359
Amount        18.700000
Name: 248750, dtype: float64
['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']
In [67]:
import lime
import lime.lime_tabular

categorical_features = []
categorical_features_idx = [df_Xtrain.columns.get_loc(col) for col in categorical_features]


explainer = lime.lime_tabular.LimeTabularExplainer(df_Xtrain.to_numpy(), 
               feature_names=feature_names, 
               class_names=['Not-fraud','Fraud'], 
               categorical_features=categorical_features_idx, 
               mode='classification')

exp = explainer.explain_instance(example, model.predict_proba, num_features=8)
exp.show_in_notebook(show_table=True)
In [68]:
exp.as_pyplot_figure(); # use semicolon
In [70]:
import shap
shap.initjs()
In [71]:
# model = CatBoostClassifier(verbose=100,random_state=100)
# model.fit(df_Xtrain, ytrain)
model = joblib.load('../models/model_cat_best.joblib')

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df_Xtest)
Learning rate set to 0.095119
0:	learn: 0.4064239	total: 84.1ms	remaining: 1m 24s
100:	learn: 0.0014365	total: 8.01s	remaining: 1m 11s
200:	learn: 0.0009553	total: 15.9s	remaining: 1m 3s
300:	learn: 0.0006789	total: 23.9s	remaining: 55.4s
400:	learn: 0.0004701	total: 31.9s	remaining: 47.6s
500:	learn: 0.0003296	total: 39.7s	remaining: 39.6s
600:	learn: 0.0002371	total: 47.7s	remaining: 31.7s
700:	learn: 0.0001719	total: 55.5s	remaining: 23.7s
800:	learn: 0.0001354	total: 1m 3s	remaining: 15.8s
900:	learn: 0.0001058	total: 1m 11s	remaining: 7.85s
999:	learn: 0.0000872	total: 1m 19s	remaining: 0us
In [72]:
df_Xtest.head(1)
Out[72]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
248750 154078.0 0.046622 1.529678 -0.453615 1.282569 1.110333 -0.882716 1.04642 -0.117121 -0.679897 -0.923709 0.371519 -0.000047 0.512255 -2.091762 0.786796 0.159652 1.706939 0.458922 0.037665 0.240559 -0.338472 -0.839547 0.066527 0.836447 0.07679 -0.775158 0.261012 0.058359 18.7
In [73]:
df_Xtest.head(1)['V15 V18 V3 V24 V1 V8 V4 V14 V2 V6 V9 V20'.split()].round(4)
Out[73]:
V15 V18 V3 V24 V1 V8 V4 V14 V2 V6 V9 V20
248750 0.7868 0.4589 -0.4536 0.8364 0.0466 -0.1171 1.2826 -2.0918 1.5297 -0.8827 -0.6799 0.2406
In [74]:
# Look only first row of test data
# use matplotlib=True to avoid Javascript
idx = 0
shap.force_plot(explainer.expected_value,
                shap_values[idx,:],
                df_Xtest.iloc[idx,:],
                matplotlib=False,
                text_rotation=90)

# for this row, the predicted label is -9.33
# red features makes it higher
# blue features makes it smaller.
Out[74]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [75]:
NUM = 100
shap.force_plot(explainer.expected_value, shap_values[:NUM,:],
                df_Xtest.iloc[:NUM,:],matplotlib=False)
Out[75]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [76]:
shap.summary_plot(shap_values, df_Xtest)
In [77]:
shap.summary_plot(shap_values, df_Xtest, plot_type='bar')
In [78]:
shap.dependence_plot("Amount", shap_values, df_Xtest)
In [79]:
shap.dependence_plot(ind='Time', interaction_index='Amount',
                     shap_values=shap_values, 
                     features=df_Xtest,  
                     display_features=df_Xtest)

Time Taken

In [80]:
notebook_end_time = time.time()
time_taken = time.time() - notebook_start_time
h,m = divmod(time_taken,60*60)
print('Time taken to run whole noteook: {:.0f} hr {:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))
Time taken to run whole noteook: 0 hr 22 min 40 secs